AITopics | acoustic unit

Collaborating Authors

acoustic unit

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

b6404bf461c3c3186bdf5f55756af908-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 17:11:31 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

b6404bf461c3c3186bdf5f55756af908-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 05:32:51 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units

Yusuf, Bolaji, Černocký, Jan "Honza", Saraçlar, Murat

arXiv.org Artificial IntelligenceJul-5-2024

End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional keyword search which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify the KWS pipeline, they generally have worse performance than their ASR-based counterparts, which can benefit from pretraining with untranscribed data. In this work, we propose a method for pretraining E2E KWS systems with untranscribed data, which involves using acoustic unit discovery (AUD) to obtain discrete units for untranscribed data and then learning to locate sequences of such units in the speech. We conduct experiments across languages and AUD systems: we show that finetuning such a model significantly outperforms a model trained from scratch, and the performance improvements are generally correlated with the quality of the AUD system used for pretraining.

acoustic unit, query, unit discovery, (13 more...)

arXiv.org Artificial Intelligence

2407.04652

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > Middle East > Republic of Türkiye (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

Gong, Hongyu, Veluri, Bandhav

arXiv.org Artificial IntelligenceMay-30-2024

Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and style transfer, meanwhile achieving better parameter efficiency.

acoustic unit, speech, translation, (17 more...)

arXiv.org Artificial Intelligence

2405.2041

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Maryland (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Peng, Yifan, Kulikov, Ilia, Yang, Yilin, Popuri, Sravya, Lu, Hui, Wang, Changhan, Gong, Hongyu

arXiv.org Artificial IntelligenceMar-18-2024

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.

s2st, speech, translation, (13 more...)

arXiv.org Artificial Intelligence

2403.12408

Country:

North America > United States > Maryland (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Unsupervised Structure Discovery for Semantic Analysis of Audio

Neural Information Processing SystemsMar-14-2024, 16:46:16 GMT

Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines.

acoustic unit, proceedings, sequence, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)

Add feedback

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Wang, Yongqi, Bai, Jionghao, Huang, Rongjie, Li, Ruiqi, Hong, Zhiqing, Zhao, Zhou

arXiv.org Artificial IntelligenceSep-14-2023

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .

representation, speech, translation, (12 more...)

arXiv.org Artificial Intelligence

2309.07566

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.87)

Add feedback

Direct Text to Speech Translation System using Acoustic Units

Mingote, Victoria, Gimeno, Pablo, Vicente, Luis, Khurana, Sameer, Laurent, Antoine, Duret, Jarod

arXiv.org Artificial IntelligenceSep-14-2023

This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.

acoustic unit, speech, translation, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LSP.2023.3313513

2309.07478

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > France (0.04)
Europe > Spain > Aragón > Zaragoza Province > Zaragoza (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

A Textless Metric for Speech-to-Speech Comparison

Besacier, Laurent, Ribeiro, Swen, Galibert, Olivier, Calapodescu, Ioan

arXiv.org Artificial IntelligenceJul-20-2023

In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units. We then propose a simple and easily replicable neural architecture that learns a speech-based metric that closely corresponds to its text-based counterpart. This textless metric has numerous potential applications, including evaluating speech-to-speech translation for oral languages, languages without dependable ASR systems, or to avoid the need for ASR transcription altogether. This paper also shows that for speech-to-speech translation evaluation, ASR-BLEU (which consists in automatically transcribing both speech hypothesis and reference and compute sentence-level BLEU between transcripts) is a poor proxy to real text-BLEU even when ASR system is strong.

machine learning, natural language, utterance, (21 more...)

arXiv.org Artificial Intelligence

2210.11835

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(5 more...)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.90)

Add feedback

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Liu, Alexander H., Chang, Heng-Jui, Auli, Michael, Hsu, Wei-Ning, Glass, James R.

arXiv.org Artificial IntelligenceMay-17-2023

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.10005

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback